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Abstract 

In this paper, we show how the Ordered Decomposition DAGs (ODD) ker¬ 
nel framework, a framework that allows the dehnition of graph kernels from 
tree kernels, allows to easily define new state-of-the-art graph kernels. Here 
we consider a fast graph kernel based on the Snbtree kernel (ST), and we 
propose various enhancements to increase its expressiveness. The proposed 
DAG kernel has the same worst-case complexity as the one based on ST, but 
an improved expressivity due to an augmented set of features. Moreover, we 
propose a novel weighting scheme for the features, which can be applied to 
other kernels of the ODD framework. These improvements allow the pro¬ 
posed kernels to improve on the classihcation performances of the ST-based 
kernel for several real-world datasets, reaching state-of-the-art performances. 
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1. Introduction 

The increasing availability of data in structured form, such as trees [1] or 
graphs [2, 3, 4], has led to the development of machine learning techniques 
able to deal directly with such types of data. Among these, kernel meth¬ 
ods, such as Support Vector Machines (SVM) [5], have become very popular 
due to their generalization ability and state of the art performances in many 
tasks, such as relationship extraction [6], analysis of RDF data [7], action 
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recognition [8], text categorization of biomedical data [9] and bioinformat¬ 
ics [10]. 

The class of kernel methods comprises all those learning algorithms which 
do not require an explicit representation of the input, but only information 
about the similarities among them. A simple way of assessing the similarity 
between two objects described by a set of features is to compute the dot 
product of their representation in feature space. If a “similarity” function 
corresponding to a dot product (•, ■) in feature space, is available, 
the intermediate step of explicitly representing the data can be avoided. 
In fact, computing k{xi,X 2 ) implicitly corresponds to perform a nonlinear 
transformation of the input vectors Xi and X 2 via a function and then 
to compute the dot product of the resulting vectors 0(a;i) and (f>{x 2 ). The 
function 0(-) projects the input vectors into a feature space of much higher 
(possibly inhnite) dimension where it is more likely to accomplish the learning 
task. Kernel methods generally formulate a learning problem as a constrained 
optimization one, where an objective function combining an empirical risk 
term with a (quadratic) regularizer must be minimized. If the employed 
kernel function is symmetric positive semidehnite, the problem is convex and 
thus has a global minimum [5]. 

Any kernel method can be decomposed into two modules: i) a. problem 
specihc kernel function; ii) a general purpose learning algorithm (the solver). 
Since the solver interfaces with the problem only by means of the kernel 
function, it can be used with any kernel function, and vice-versa. Examples 
of popular kernel methods are the perceptron [11] for the on-line setting, and 
the Support Vector Machines [5] for the batch setting. Note that, provided 
an appropriate kernel function is given, any kernel method can be applied 
to any type of data. More importantly, the kernel function encodes all the 
information about the input data, thus the dehnition of appropriate kernel 
functions is crucial for the outcome of the learning algorithm. 

A popular strategy for dehning kernel functions for structured data is 
to decompose the structures into their constituent parts, and then, for each 
pair of parts, apply a local kernel [12]. While this strategy has been proved 
successful for strings and trees [13, 14, 15, 16, 17, 18], it is not directly appli¬ 
cable to graphs because of the computational complexity issues which arise: 
representing a graph in terms of its subgraphs is not feasible since subgraph 
isomorphism, an NP-complete problem, should be solved for each pair of 
subgraphs. In [19] it has been demonstrated that, any kernel whose feature 
space mapping is injective, is as hard to compute as graph isomorphism, an 
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NP problem that still is not known whether it is in P or if it is NP-complete. 
Due to this limitation, the available strategies for building kernels are: i) re¬ 
stricting the input domain to a class of graphs for which isomorphism can be 
checked quickly [20]; ii) select a priori a set of features, usually corresponding 
to a specihc type of substructure, such as walks [19], paths [21, 22], subtree 
patterns [23, 24], The former approach can be applied to a limited type 
of graphs, the latter tends to have a limited flexibility: when the available 
kernels are not relevant to the task, a new one has to be designed. However, 
dehning an efficient symmetric positive semidefinite kernel, corresponding to 
the desired feature space, can be an extremely difficult task. All the above 
approaches discard information about the original graph and are effective 
only when the selected features are relevant for the current problem. We 
propose to design graph kernels as follows: hrst transform the graphs into 
simpler structures, i.e. multisets of directed acyclic graphs (DAGs), and then 
extend the dehnition of a large class of already available kernels for trees to 
DAGs. Our approach allows the application of the vast literature on kernels 
for trees, which consists of fast and/or very expressive kernels, to the graph 
domain. 

Generally speaking, a serious drawback which prevents many of the ker¬ 
nels listed above to be applied to large datasets is their computational time 
complexity. Those kernels which can be applied to large datasets exploit a 
“limited” number of features to represent a graph. For example, the ker¬ 
nel proposed in [24] has a linear complexity in the number of edges of the 
graphs because any graph is represented in the feature space by a number of 
non-zero features which is proportional to the number of nodes of the graph. 
On the other hand, a too compact representation of a graph in feature space 
may have a negative impact on the effectiveness of the kernel because of a 
reduced discrimination ability. 

In this paper, we tackle this problem by proposing various enhancements 
to a fast graph kernel based on the Subtree kernel for trees (ST) [25]. Among 
these, the main contribution is a novel tree kernel, which has the same worst- 
case complexity of the ST kernel, while the size of its feature space is much 
larger. 

The paper is structured as follows. Section 2 introduces some basic no¬ 
tation and dehnitions. Section 3 recalls the ODD framework, of which the 
proposed kernels are instances. Section 4 describes the main contributions 
of the paper: the ST+ kernel for DAGs and a novel weighting scheme for 
the features, which can be applied to other kernels of the ODD framework. 
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Section 5 discusses some related kernels for graphs, and Section 6 provides ex¬ 
perimental evidence of the effectiveness of the proposed approaches. Finally, 
Section 7 draws conclusions. 

The paper extends the work in [26] by adding: i) a, self-contained and 
simplihed description of the S'T-|- kernel; ii) a novel, more effective, feature 
weighting scheme; in) an extended and revised “Related Work” section; iv) a 
novel set of experiments which are now performed on much larger benchmark 
datasets and for a larger number of competing graph kernels; v) a comparison 
among empirical execution times of the various experimented kernels. 

2. Notation 

A graph is a triplet G = (y,E,L), where V (alternatively Vg) is the 
set of nodes (|R| is the number of nodes), E the set of edges and L() a 
function returning the label of a node. All labels are obtained from a hxed 
alphabet A. A graph is undirected if {vi,Vj) E E {vj,Vi) G E, otherwise 
it is directed. A path in a graph is a sequence of nodes Vi,... ,Vn such that 
Vi ^ V,1 ^ i ^ n, {vi,Vi+i) G E and VI < i < n, 1 < j < n, j 7 ^ i.Vi 7 ^ Vj 
(no node, except the first one, can appear twice in the same path). A cycle 
is a path for which vi = Vn] a cycle is even/odd if its number of nodes is 
even/odd, respectively. A graph is connected if there exists a path connecting 
each pair of nodes. A connected graph is rooted if exactly one node has no 
incoming edges. A graph is ordered if the set of neighbours of each node is 
ordered. A tree is a rooted connected directed acyclic graph where each node 
has at most one incoming edge. A subtree of a tree T is a connected subset 
of nodes of T. A proper subtree is a subtree composed by a node and all of 
its descendants. Given a node n of a tree, p{v) represents the outdegree of 
V, i.e. the number of nodes connected to v. We will use p as the maximum 
outdegree of a node in either a tree or a graph. The depth depth{v) of a 
node V is the number of edges in the shortest path between the root of the 
tree and v. If the tree is ordered, ch„[j] represents the j-th child of v and 
chst,[ji, • • • )in\ indicates the set of children of v with indices ji, J 2 , • • • An- 
Given a graph G and a node v G V{G), we define a subtree-walk of size h 
as the tree obtained by the following procedure: the root of the tree is v, at 
each step i, with 1 < i < h, and for each current leaf node Vj of the tree, any 
neighbouring node of Vj in G is added to the tree as a child of Vj. Note that, 
when h > 1, typically a node of G can appear multiple times in the same 
subtree-walk. Given a DAG D and a node Vi G V{D), we dehne a tree-visit. 
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denoted by as the tree resulting from the visit of D starting from the 
node Vi- Such visit returns all the nodes of D reachable from Vi. If a node 
Vj can be reached more than once, more occurrences of Vj will appear in 
(see Figure 2-b for an example). 

3. Preprocessing: from Graphs to Multisets of DAGs 

This section recalls the ODD-Kernels framework for graphs [27]. The 
idea of our approach is to transform the graphs into simpler structures, i.e. 
DAGs, and then apply a kernel for such structures. The following subsections 
explain each step of the transformation. 

3.1. From Graph to DAGs 

The graph G is mapped into a multiset of DAGs DDg = {DD^lvi G Vq}, 
where DD^ = (Vq\Eq, L) is obtained by keeping each edge in the shortest 
path(s) connecting Vi with any Vj G Vg- From a practical point of view, DD^ 
can be built by performing a breadth-hrst visit on the graph G starting from 
node Vi and applying the following rules: 

1 . during the visit a direction is given to each edge; if Vj is reached from 
Vi in one step, then {vi,Vj) G Eq (note that edge {vj,Vi) is not added 
to 

2 . edges connecting nodes reached at level I of the visit to nodes reached 
at level g < I are not added to Eq (such edges would induce a cycle in 
DDg.) 

For every choice of G and Vi, a single Decompositional Dag DDg is gener¬ 
ated. By repeating the procedure for each node of the graph, \V\ DAGs are 
obtained. Figure 1 shows the four DDs obtained from the undirected graph 
in Figure 1-a. Note that when the same node is reached simultaneously (at 
the same level of the visit) from different nodes, then all involved edges are 
preserved. For example, when considering the visit at level 2 starting from 
node s, the node d is reached simultaneously by edges (b, d) and (e, d), and 
both of them are preserved in the corresponding Decompositional DAG (see 
Figure 1-b). In order to reduce the total number of nodes of DDg, we pro¬ 
pose to limit the depth of the visits during the generation of the multiset of 
DAGs [27] to a constant value h. The resulting DAG will be referred to as 
DDg^. Given v G Vg, let D be the number of nodes generated by the visits 
up to depth h. An upper bound for El is p^. Notice, this is a loose bound. 
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a) 




c) 


e) 


Figure 1: Example of decomposition of a graph a) into its 4 DDs b-e). 




in many practical cases. The total number of nodes of DDg is \Vg\H. Note 
that, if p is constant, then also H is constant. 


3.2. Ordering DAG nodes 

The kernels we dehne in the following, which are all straightforwardly 
derived from tree kernels, require DAG nodes to be ordered. Therefore, 
we dehne a strict partial order < between DAG nodes in DD^ obtaining 
Ordered DAGs ODD^. The ordering makes use of a unique representation 
of subtrees as strings inspired by [14]. Here we modify such mapping by 
employing perfect hash functions, i.e. hash functions which guarantee to 
have no collisions, to encode subtrees [24, 28]. Let k{) be a perfect hash 
function, [, J be symbols never appearing in any node label and ch^[j] the 
j-th node in the ordered sequence of outcoming edges of v, then 



n{L{v)) if n is a leaf node 

k{L{v) [vrf a^')# ... otherwise 


( 1 ) 


where the children of v are recursively ordered according to their 7r() values. 
To simplify notation, in the following, when it is clear from the context, we 
will use the notation 7i{v) instead of 7r(A)- Then Vi<Vj if 7i{vi) < 71 ( 0 j), 
where < is the relation of order between alphanumeric strings. Notice that 
n^Vi) = T^ivj) -‘{yi<Vj) A -‘{vjKVi), i.e. nivi) = T^ivj) if and only if the 
nodes Vi and Vj are not comparable. In such case, many orderings for non 
comparable children nodes according to < are possible. We are now going to 
prove some results that will make it easier to show, in Section 4, that each 
kernel described in this paper (as well as for a large class of kernels for trees) 
yield the same features, independently of the ordering of non comparable 
nodes. Since all the features of the kernels in Section 4 are extracted from 
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tree visits of DAG nodes, our goal here is to show that isomorphic DAGs 
yield the same tree visits. We hrst show that if two DAGs DDq^ and DDq^ 
are isomorphic, then the root nodes of the DAGs are not comparable with 
respect to the ordering <, in fact: 

Theorem 3.1. if two DAGs DD'q^ and DDq^ are isomorphic, then 
-^{vi<Vj) A -^{vj<Vi). 

Proof Let / : t Vg 2 be an isomorphism between DD'f and . We 

prove the thesis by induction. Let f{vi) = Vj, since the nodes are iso¬ 
morphic L{vi) = L{vj). If Vi and Vj are leaf nodes, then n{vi) = 7i{vj) 
and consequently -'{vi<Vj) A -'{vj<Vi). Otherwise, by inductive hypothe¬ 
sis V/.l < / < p{vi). 7r{chy^[l]) = 7r(ch/(^,)[/]) and L{vi) = L{f{vi)), thus 
7r(D) = = Tr{vj). 

The following theorem shows that two non comparable nodes Vi,Vj, yield 
identical tree visits A, A: 

Theorem 3.2. Given the ordering <, ^{vi<Vj) A ^{vj<Vi) if and only if /\ 

Vj 

and A ore identical. 


Proof If -^{vi<Vj) A -^{vj<Vi) then Tiivi) = 'n{vj). Recalling that «;(), the 
function on which 7r() is based on, is a perfect hash function, we prove the 
thesis by induction. If Vi,Vj are leaf nodes, then n{vi) = 7r(vj) AA L(a) = 


L{vj). If Vi,Vj are not leaf nodes, then V/.l < / < p{vi) 


ch„. [Z] 

A is identical 


chy. [Z] 

to A for inductive hypothesis, and then it must be L(a) = L{vj) since 
TT{vi) = 'n{vj)] therefore A is identical to A. Now we show that if A is 

Vj 

identical to A, then n{vi) = 7i{vj) by induction. The base case has already 
been proved by the equality n{vi) = 7i{vj) AA L{vi) = L{vj). By inductive 
hypothesis 7 r(c/i„Jm]) = Ti{chy.[m]) for each child m of a and Vj. Then 
7r(A) = 7i(vj) and -'(a<g) 


Note that, since any ordering between non comparable vertices is equiv¬ 
alent for our goals, we avoid to give a specihc ordering. If the 7 r() values 
are computed according to a post order visit of the DAG, then the values 
7i{chy[l]) for 1 < / < p{v) are already available when computing 7i{v). Thus 
the time complexity of the ordering phase of the DAG is 0(| Vdlplog p) where 
the term p log p accounts for the ordering of the children of each node. 
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3.3. Applying Tree Kernels to DAGs 

If we restrict to the kernels which are going to be presented in this paper, 
the general formula for graph kernels derived from the ODD framework [27] 
can be simplified as follows 

ODDk{Gi,G,) {4‘'HDi),4>'‘{D2)), (2) 

Di^ODDq^ 

D2GODDq^ 

where (•, •) is the dot product operator, and is the explicit fea¬ 

ture space projection of the DAG D with respect to the kernel K and 
ODDg = {ODDq^\v G Vg}. Section 4.1 gives an example of an instance 
of the kernel dehned in (2). 

4. Kernels for DAGs 

In Section 3, we showed a preprocessing procedure for transforming a 
graph into a multiset of ordered DAGs. In this section, we first recall the 
ODDst^ kernel, presenting it in a slightly different way than as it was orig¬ 
inally introduced in the paper [27]. Then, we describe the original contri¬ 
butions of the paper, i.e. a novel kernel for DAGs, named ST+, and a 
novel weighting scheme for the features which is specifically designed for our 
setting. 

4-.1. ST kernel for DAGs 

V 

Let us consider the tree resulting from the visit of ODDg starting 

V 

from the root node v. The visit can be stopped when the tree ^ reaches a 

V 

maximum depth h. Such tree is referred to as . 

As an example of kernel in (2), we recall the ODDst^, kernel [27]. The 

V 

features of the kernel are for each n G Vd, where D G ODDg as dehned 
in the previous section and for each 0 < I < h. Specihcally, any node v 
of the DAG contributes to the feature vector (!){■) as (f^v) = A^, where 
size = for some /, and n{v) (we recall that this notation stays for 

7r(^|P) is the function dehned by (1). This weighting scheme for the features 
is inherited by the ST [25] kernel and it is motivated by the fact that when 
computing a kernel involving two matching large trees, the value returned 
by the kernel is very large because not only the whole trees will match, but 



all their subtrees will match as well. To correct that, the contribution to the 
kernel of a matching tree is down-weighted by A^, where 0 < A < 1. 

In order to demonstrate that the resulting graph kernel is positive semidef- 
inite, we need to prove that our ([){■) function is well-dehned, i.e. it gives the 
same result when the representation of the input is changed without changing 
the value of the input. If two graphs are isomorphic, they generate the same 
multiset of DAGs (since they are dehned over shortest paths). We know 
from Theorem 3.1 that isomorphic DAGs generate the same visits. Since 
the features considered by the ST kernel are subtrees, it directly follows from 
Theorem 3.2 that the swapping of non comparable vertices in the ordering do 
not affect the feature space representation of a graph. Thus, we provided a 
well-dehned feature space representation for ODDst^, from which it follows 
that the kernel is positive semidehnite. 

4-2. The ST+ Kernel for DAGs 

The kernel we introduce in this section enlarges the feature space of the 
ST kernel, with a modest increase in computational burden, and is referred 
to as ST+. In Algorithm 1 we dehne a procedure to compute the explicit 
feature space representation 0(-) of ST+. Note that this procedure accesses 

the graph only by means of and /\|^, moreover if two trees and are 
identical, than also all their subtrees are. Thus, if two nodes generates the 

/ \ / \ 1 chm[vj] 

same 7r(Vi) = 7r(Vj), then A = A ^'^cl A = ^1/ each m and 

1. Thus, by Theorem 3.2 the procedure is well dehned also in the presence 
of non-comparable nodes, since the resulting tree visits are the same. This 
proves that the kernel is positive semidehnite. The set of features related to 
the ST + kernel is a superset of the features of ST and a subset of the features 
of PT [15]. Line 8 of Algorithm 1 depicts a generic feature introduced by 
S'T-I-. Given a node v and an index j, the feature is dehned as the subtree 

V 

A where all subtrees rooted at children of v, except for the j-th child, are 
replaced by a corresponding limited visit of I levels. Notice that the feature 
actually depends on u G Vd, the index of a child j and a limit I on the depth 
of the visits. The function 7r(/) returns the index of the feature / in 0(-). 
Figure 2 depicts a partial feature space representation of a DAG according 
to S'T-I-. While for the ST kernel there is one feature for each v G Vd, 
ST+ associates at most (p(u) ■ h) + 1 features for any v & Vd- For each 
node V & Vd, for example the node with label v highlighted in Figure 2-a, 
the algorithm inserts the following features: 
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Algorithm 1 A procedure for computing the features of the ST + kernel. 


1: Input: an ordered DAG D, the maximum depth of the visit h 
2: for each v G Vb do 

3: / = A 

1/1 

4: ^TT(f) = // the proper subtree rooted at u as a feature. 

5: //if thfi feature is first encountered, it is assumed 07r(/) = 0 

6: for 0 < I < mm(h, depth(f)) do 

7 : for 1 < j < p(v) do 


8 : 

9: 

10 

11 

12 

13 


/' = 


chi [u] 

A t 



A 


= 07 r(/') + A 2 II add the subtree f as a feature. 

end for 
end for 
end for 

Output: 0(*), the set of features of D 


1. the proper subtree rooted at u, which in our example is the one in 
Figure 2-b; 

2. given chj[v\, the subtree composed by: 

• u; 

• the proper subtree rooted at the j-th child of u; 

• the subtrees resulting from a visit limited to 1 < / < h levels 
starting from the other children of v 

is added as feature. As I ranges from 0 to h, the features/subtrees from 
Figure 2-c to Figure 2-e are added. 

Recalling that H is the number of nodes in a DAG ODD^, the complexity 
of Algorithm 1 is 0{Hh‘^p^\ogp). The complexity of the ODD kernel in 
(2), instantiated with ST+ as base kernel is 0(| Vdl log | Vg|), assuming p 
constant. 

A Novel Feature Weighting Scheme 

The features associated with many kernels for graphs, including ODD^j'^ 
and ODD 5 'r+, are not independent from each other. They are, instead, or¬ 
ganized in a hierarchical structure [29]. Let us consider the 0DD5T'^ kernel 
as an example: given any pair US such that ti is a subtree of t, if t occurs 
as a feature for a graph G, then ti must occur as features as well. As a 
consequence, sticking to our example, there is a monotonic increasing rela¬ 
tionship between the frequencies of the subtree features ti and the subtree 
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Figure 2: Feature space representation related to the kernel ST+: a) an 
input DAG; b) the proper subtree rooted at the node labelled as v; c)-e) 
given the child x of v, the features related to visits limited to I levels. 


features t. Such relationship is quantihed in the upper-left plot of Figure 3, 
which reports the frequencies of the features generated by the ODDst^ ker¬ 
nel, for h G {0,..., 3}, on one of the datasets we will consider in Section 6. 
The points in the x-axis correspond to features, sorted according to their 
weights. The y-axis, since A = 1, reports the frequencies of the features in 
the dataset, i.e. the number of times each feature appears in all input graphs. 
Note that the x-axis is in logarithmic scale. The frequencies are distributed 
according to a Ziphan distribution, which means that there are very few fea¬ 
tures with high frequency. Given the structured nature of the feature space, 
such features are the “simple” ones, i.e. those associated with small sized 
subtrees, for example single nodes. Any kernel function evaluation will then 
be highly influenced by such features, which are typically the least discrim¬ 
inative ones. In the case of the ODD^'^^ and ODD^r.!. kernels, which we 
recall hrst decompose the graph into a set of DAGs, the difference between 
the frequencies of small-sized and large-sized features is even greater since 
they are extracted from multiple DAGs: the smaller the size of a subtree, the 
more likely for it to appear in multiple DAGs. The fact that the distribution 
of weights of the features is particularly skewed, may negatively impact the 
predictive performance of the kernel since, in principle, we would like to give 
more emphasis to (i.e. to weight more) bigger, discriminative features with 
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Figure 3: Comparison between the weighting schemes wcif) (3) and 
wtcif) (4). On the x axis, in a logarithmic scale, the hrst 100 features 
generated by the 0DD5'r^ kernel for different h values. The y axis reports 
the cumulative weight of each feature among all the graphs in the dataset. 


respect to small ones, that tend to appear in almost all examples, and thus 
are generally not correlated with the target concept. 

One way to tackle this issue is to adopt the weighting scheme explained 
in Section 4.1, that has been designed specifically for the case of the compu¬ 
tation of tree kernels [25]. This scheme has been implemented in the original 
ODDsTh kernel formulation, and we maintained it for the proposed ODDst+ 
kernel; given a graph G, the weight wcif) of each feature / (see lines 4 and 
9 of Algorithm 1) is computed as 


WgU) = freqcif) ■ (3) 

where freqaif) is the frequency of the feature / in G. Therefore the contri¬ 
bution to the kernel of the same matching feature (computed via dot product) 
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in two input graphs Gi and G 2 is freqc^if) ' f'^^QG 2 if) ' ^ value of A 

greater than 1 would give more importance to large matching trees. How¬ 
ever, the contribution of the less frequent, possibly interesting, small features 
could be underweighted. The upper-right plot in Figure 3 shows that, with 
this weighting approach, there are slightly more features with a relatively 
high weight w.r.t. the case where no weighting scheme is applied (i.e. when 
A = 1). Nonetheless, the distribution is still very skewed. 

Another possibility is to define a different weighting scheme, more suited 
to our approach. As a hrst step in this direction, we propose to mitigate the 
contribution of otherwise overweighted features with a different dehnition^ 
of wcif), in the following denoted as wtcif), i.e. 

wtcif) = tanh(Al^l) ■ tanh(/regG(/)), (4) 

where tanh(-) is the hyperbolic tangent function. Note that the original 
weighting scheme depends nonlinearly (exponentially) on the size of the fea¬ 
ture I/I and linearly on its frequency. The novel scheme we are proposing, on 
the other hand, depends nonlinearly on both |/| and fregcif)- ^^is way, 
the contribution of each feature is smoothly and non-linearly normalized in 
the interval [0,1]. 

Note that the hyperbolic tangent function is almost linear around zero, and 
asymptotically tends to one for positive values. This means that the contri¬ 
bution of frequent features is truncated, while the less frequent features are 
still discriminated since they fall in the linear part of the function. The same 
is true for the A^-fl factor. 

The lower plots in Figure 3 reports the weights distribution according to 
the new wtc weighting function proposed in (4) with A = 1 and A = 1.8, 
respectively. The hnal result is that the weights are distributed in a smoother 
way. 

The new weighting scheme is applied to the ST kernel, obtaining a variant 
of the kernel proposed in [27], and to the ST-|- kernel hrst proposed in this 
paper. Note that this novel weighting scheme is just one possibility among 
several ones. The key point is that we want to achieve a smoother distribution 
of the weights associated to the features. The tanh function implements all 
our desiderata, but any other sigmoidal function can be adopted. Notwith- 


^This is an evolution of the scheme proposed in [26]. 
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standing the heuristic nature of our choice, the experimental results we have 
obtained on several real world datasets, as reported in Section 6, show that 
the novel proposed weighting scheme allows to reach statistically signihcant 
improvements over state-of-the-art kernels. This seems to conhrm that both 
our intuition on the smoothness of the weight distribution, as well as its 
implementation via the tanh function, are useful. 

5. Related work 

Graph data is usually high-dimensional. For this reason, in order to 
perform learning on graph datasets, there are two possible approaches: 

1. applying a preprocessing phase aimed at selecting possibly relevant 
features; 

2. in the context of kernel methods, using tractable kernel functions. 

Generally speaking, the methods following the first approach extract fre¬ 
quent patterns, build a vectorial representation of the graphs according to 
such patterns and then apply a kernel method. When the kernel method is 
an SVM, the approach is referred to as SVM with frequent pattern mining 
(freqSVM). The techniques for extracting the features include Gaston [30], 
Gorrelated Pattern Mining (GPM) [31], MOLFEA [32]. Saigo et ah [33] 
proposed gBoost, a boosting method that progressively collects informative 
(according to the target output) patterns. 

The second approach includes a set of kernel functions for graphs. The 
Marginalized Graph Kernel (MGK) considers common walks as features [34] 
(the work has been extended in order to make it more efficient and effective 
in [35]). Informally, this kernel is dehned as the expected value of a kernel 
over all possible pairs of label sequences generated by random walks on two 
graphs. The worst case time complexity of the algorithm presented in [36] is 
OdGcp). 

The Shortest Path Kernel associates a feature to each pair of nodes of one 
graph. The value of the feature is the length of the shortest path between 
the corresponding nodes in the graph [37]. The complexity of the kernel is 
0{\V\^). Being the Shortest Path Kernel based on paths, it can be repre¬ 
sented as an instance of (2). We do not report experimental results about 
this kernel because of its high computational complexity, and its inferior re¬ 
sults compared to other state of the art kernels on many of the datasets 
considered in this paper [24, 38]. 
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In [39] it is described an effective method for computing path based ker¬ 
nels. First a graph is decomposed into a set of trees of totally t nodes. Then 
the Burrows-Wheeler transform is employed for fast and space-efficient enu¬ 
meration of paths. The complexity of the kernel is 0(tlogt'^), with e < 1. 
The graphlet kernel [40] counts all types of matching subgraphs of small size 
k (e.g. /c = 3,4 or 5). There are efficient schemes for computing this ker¬ 
nel, but they are applicable only on unlabeled graphs. For the labeled case, 
the computational complexity of this kernel is 0{n^). In the experimental 
section of this paper, we considered the Graphlet kernel instantiated with 
k = 3, that will be referred as 3-Graphlet kernel. 

The Weisfeiler-Lehman Fast Subtree kernel (FS) counts the number of 
identical subtree patterns obtained by subtree-walks up to height h [24, 38]. 
The complexity of the kernel is 0{\E\h). While being fast to compute, the 
kernel may lack of expressiveness for some tasks given that the number of 
non-zero features generated by one graph is at most \V\h. Note that the 
subtree-walks extracted by the kernel differ from the tree structures extracted 
by the kernels proposed in Section 4: in FS a node usually appears multiple 
times in the same subtree-walk, while in the ODD kernel only DAG nodes 
which have multiple incoming edges appear multiple times in the extracted 
tree structures. Such difference makes the Weisfeiler-Lehman Fast Subtree 
kernel not reproducible from (2); a discussion on the differences between 
the feature spaces induced by the Weisfeiler-Lehman Fast Subtree and the 
ODDsTh kernels can be found in [27]. Moreover, the features of the FS 
kernel are subtree-walks, while specihc features (as explained in Sections 4.1 
and 4.2) are extracted from the tree-visits obtained from the ODDsth cind 
ODDst+ kernels. 

Gosta and De Grave [21] extended the Fast Subtree Kernel by comput¬ 
ing exact matches between pairs of subgraphs with controlled size and dis¬ 
tance. Their kernel, named Neighborhood Subgraph Pairwise Distance Ker¬ 
nel (NSPDK), has 0(|K| |14| |i7/i| log |i7/i|) time complexity, where |14| and 
\Eh\ are the number of nodes and the number of edges of the subgraph ob¬ 
tained by a breadth-fist visit of depth h. The authors state that, for small 
values of the subgraph size and distance, the complexity of the kernel be¬ 
comes in practice linear. 

The Weisfeiler-Lehman Shortest path Kernel proposed in [38] is similar 
in spirit to the NSPDK kernel. Indeed, it considers pairs of subtree patterns 
and their distance. However it does not limit the maximum distance between 
the considered patterns, resulting in a computational complexity of O(n^). 
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Kernel 

Complexity 

RW [34] 

SP [37] 

WL-SP [38] 
3-Graphlet [40] 
Treelet [41] 

FS [24, 38] 
NSPDK [21] 

0(|V|^) 

o(ivn 

o(ivn 

0(|V|3) 

0 {\v\p^) 

0{\E\hy 

0(|V|)*’” 

ODDst [27] 
ODDs''r_,_ 

0{\V\log\V\r 

0{\V\log\V\y 


Table 1: Computational complexity of the Shortest Path, the 3-Graphlet, the 
fast Subtree, the NSPDK, the ODD^^ and ODD 57 ’_|_ kernels. *: considering 
p constant; **: with high constants. 


Mahe and Vert [23] described a graph kernel based on extracting tree patterns 
from the graph. The difference with the approach of this paper is that the tree 
patterns are obtained as result of walks on the graph, i.e. the same node can 
appear more than once in the same tree pattern. The complexity of the kernel 
is 0{\Vi\\V2\hp^p), where h is the depth of the visit. Finally, [41] proposed 
the treelet kernel, based on frequent pattern mining of tree-substructures. 
The kernel implementation considers subtrees with a maximum of 6 nodes, 
and its computational complexity is 0{np^). Table 1 summarizes the 
computational complexity of some of the kernels cited in this section, and 
the ones proposed in this paper. Moreover, just to give an idea about how 
many features are generated by a graph kernel on a real-world dataset, in 
Figure 4 we have reported the number of different features generated on a 
chemical dataset (NCIl) by the most efficient aforementioned kernels. 

6. Experimental results 

6.1. Experiments on common benchmark graph datasets 

The experimental assessment of the proposed kernels has been performed 
on a total of eight datasets. The hrst six datasets involve chemo and bioin¬ 
formatics data: CAS^, CPDB [32], AIDS [4], NCIl, NCI109 [3] and GDD [2]. 


^http: //www.cheminformatics.org/datasets/bursi 
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Number of generated features for NCI1 dataset 



h 

Figure 4: Number of features generated by the ODDst^, ODDst+, FS and 
NSPDK kernels on the NCIl dataset as a function of their parameter h. 


The hrst hve datasets involve chemical compounds and represent binary clas- 
sihcation problems. The nodes are labeled according to the atom type and 
the edges represent the bonds. GDD is a dataset composed by proteins rep¬ 
resented as graphs, where the nodes of the graphs represent amino acids and 
two nodes are connected by an edge if they are less than bA apart. More¬ 
over, we adopted from [42] two real-world image datasets: MSRC9-class and 
MSRC21-class^. Each image is represented by its conditional Markov random 
held graph enriched with semantic labels, and the task is scene classihcation. 
Both the datasets are multi-class single-label classihcation problems. For our 
experiments, we adopted a SVM classiher [43]. For the multi-class problems, 
we adopted a one-vs-one scheme. We compare the predictive abilities of the 
ODDst+ kernel and the two proposed variants and to 

the original ODDst+ kernel [27], the Fast Subtree Kernel (FS) [24] and the 
Neighborhood Subgraph Pairwise Distance Kernel (NSPDK) [21]. Moreover, 
we also report the performances of the p-random walk kernel, that is a kernel 


^http://research.microsoft.com/en-us/projects/ObjectClassRecognition/ 
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Dataset 

graphs 

pos(%) 

avg nodes 

avg edges 

CAS 

4337 

55.36 

29.9 

30.9 

CPDB 

684 

49.85 

14.1 

14.6 

AIDS 

1503 

28.07 

58.9 

61.4 

NCIl 

4110 

50.04 

29.9 

32.3 

NCI109 

4127 

50.37 

29.7 

32.1 

GDD 

1178 

58.65 

284.3 

2862.6 

MSRC_9 

221 

multi-class 

40.6 

97.9 

MSRC_21 

563 

multi-class 

77.5 

198.3 

NCI123 

40952 

4.76 

26.8 

28.9 

NCLAIDS 

42682 

3.52 

45.7 

47.7 


Table 2: Statistics of CAS, CPDB, AIDS, NCIl , NCI109, GDD, MSRC_9, 
MSRC_21, NCI123 and NCIA^IDS datasets: number of graphs, percentage 
of positive examples, average number of atoms, average number of edges. 


that compares random walks up to length p in two graphs (special case of [34] 
and [35] ) as representative for the family of kernels based on random walks, 
and the graphlet kernel [40]. Note that the complexity of the graphlet kernel 
(when applied to labeled graphs) is exponential in the size k of the graphlet. 
Because of that, following [38], we restricted our experimentation to a value 
of k that allows for an efficient computation of the kernel, i.e. k = 3. 

The experiments are performed using a nested 10-fold cross validation: 
for each of the 10 folds another inner 10-fold cross validation, in which we 
select the best parameters for that particular fold, is performed. All the 
experiments have been repeated 10 times using different splits for the cross 
validation, and the average results (with standard deviation) are reported. 
For all the experiments, the values of the parameters of the ODDst^ 
ODDst+ kernels, including their variants using tank, have been restricted 
to: A = {0.1, 0.2,..., 2.0}, h = (1, 2,..., 10}. For the Fast Subtree kernel 
the only parameter h = (1,2, ...,10} is optimized. For the NSPDK, the 
parameters h = (1,2,..., 8} and d = (1, 2,..., 7} are optimized. Finally, for 
the p-random walk kernel we selected p = (1, 2,..., 10}, and for the graphlet 
kernel we considered only the graphlets of size 3, as mentioned above. A 
10x10 CV test with conhdence level 95% (and 10 degrees of freedom) has been 
executed between each pair of kernels on all datasets [44]. In the following 
the term signihcant will refer to this statistical test. Table 3 reports the 


18 




Kernel 

CAS 

CPDB 

AIDS 

NCIl 

p-random walk 

70.16* (8) 

± 0.20 

64.14* (8) 

± 1.35 

73.55* (8) 

± 0.49 

±- 

Graphlet 

71.10* (7) 

± 0.48 

67.36* (7) 

± 0.96 

73.98* (7) 

± 0.65 

69.68* (7) 

± 0.52 

FS 

83.32* (6) 

76.36 (5) 

82.02 (5) 

84.41 (4) 


± 0.37 

± 1.48 

± 0.4 

± 0.49 

NSPDK 

83.60* (2) 

76.99 (1) 

82.71 (1) 

83.45 (5) 


± 0.34 

± 1.15 

± 0.66 

± 0.43 

ODDsTft 

83.34* (4) 

76.44 (4) 

81.51 (6) 

82.10* (6) 


± 0.31 

± 0.62 

± 0.74 

± 0.42 

ODDTAnh 
o J- h 

83.40* (3) 

76.56 (3) 

82.51 (3) 

84.57 (3) 


± 0.41 

± 0.97 

± 0.52 

± 0.43 

ODDst+ 

83.90 (1) 

76.30 (6) 

82.06 (4) 

84.97 (1) 


± 0.33 

± 0.23 

± 0.70 

± 0.47 

ODDTAnh 

ST+ 

83.33* (5) 

76.74 (2) 

82.54 (2) 

84.81 (2) 


± 0.34 

± 1.81 

± 0.75 

± 0.41 

Kernel 

GDD 

NCI109 

MSRC_9 

MSRC_21 

p-random walk 

±- 

±- 

67.01* (7) 
± 2.22 

18.88* (8) 

± 1.4 

3-Graphlet 

74.92 (6) 

± 1.40 

68.07* (7) 

± 0.31 

60.83* (8) 

± 2.0 

19.66* (7) 

± 0.96 

FS 

75.46 (3) 

± 0.98 

85.02 (1) 

± 0.44 

89.26* (6) 

± 0.82 

89.87 (6) 

± 0.71 

NSPDK 

74.09 (7) 

± 0.91 

84.17 (2) 

± 0.33 

89.48* (4) 

± 1.0 

90.24 (3) 

± 0.49 

ODDsTft 

75.27 (5) 

81.91* (6) 

90.80 (3) 

89.92 (5) 


± 0.68 

± 0.42 

± 1.10 

± 0.73 

ODDTANH 

o j- h 

76.09 (1) 

± 0.85 

83.68 (4) 

± 0.39 

94.39 (1) 

± 1.21 

92.60 (1) 

± 0.45 

ODDST4. 

75.33 (4) 

± 0.81 

83.08* (5) 

± 0.49 

89.33* (5) 

± 1.2 

89.94 (4) 
± 0.80 

ODDTANH 

75.52 (2) 

± 0.88 

83.93 (3) 

± 0.42 

92.99 (2) 

± 1.26 

91.74 (2) 

± 0.77 


Table 3: Average accuracy results ± standard deviation in nested 10-fold 
cross validation for the p-random walk, the Graphlet, the Fast Subtree, the 
Neighborhood Subgraph Pairwise Distance, the ODDst^i fhe ODDg;^^^, the 
ODDst+ and the kernels on CAS, CPDB, AIDS, NCIl, '^GDD, 

NCI109, MSRC_9 and MSRC_21 datasets. The rank of the kernel is reported 
between brackets. The symbol * denotes the kernels whose performance 
difference with respect to the top-ranked kernel is statistically signihcant. 
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average accuracies and the rankings obtained by the different kernels on the 
considered datasets. The symbol * in Table 3 denotes, for each dataset, the 
kernels whose performance difference with respect to the top-ranked kernel 
is statistically signihcant. 

Let us now focus on the experimental results obtained for the six chemi¬ 
cal datasets. The kernels ODDst+: together have best 

accuracy on three out of six datasets, and the second best accuracy on two 
others. On the datasets in which the FS and NSPDK kernels perform better 
than the ODD ones, i.e. CPDB, AIDS and NCI109, the performance dif¬ 
ference, at least with respect to the best performing ODD kernel, is never 
signihcant. Note that ODDst+ performs signihcantly better than NSPDK 
and FS on the CAS dataset. The variant employing the hyperbolic tangent 
is always useful for the ST kernel, making it the best performing kernel on 
ODD, and is able to boost the accuracy performance of ODDst+ on AIDS, 
CPDB , GDD and NCI109 datasets. The generally good results of the ODD 
kernels, with respect to FS and NSPDK, may be attributed to the fact that 
they have associated a large feature space, which makes them more adapt¬ 
able to different tasks. Note that the execution of p-random walk kernel did 
not complete in 4 days for NCIl, NCI109 and GDD datasets, so the results 
are missing. 

Let us now focus on the image datasets (MSRC_9 and MSRC_21). On 
these datasets, the baselines FS, NSPDK, ODDst+ kernels and the proposed 
ODDst+ kernel show very similar performances. On these datasets, the 
introduction of the hyperbolic tangent weighting scheme is very benehcial. 
Both and performs better than all the baselines, with 

the former being the best performing kernel on both datasets. 

The p-random walk kernel and the graphlet kernel show poor performances 
on these datasets. We argue that this is because they are the only ones among 
the considered kernels that do not consider all the neighbors of a node as a 
feature. 

Figures 5 and 6 report the computational times required by the 0DD5'r^, 
ODDst+, NSPDK and the FS kernels as a function of the parameter h de¬ 
termining the size of the considered substructures on the NCIl and CAS 
datasets, respectively. 

All the experiments are performed on a PC with two Quad-Core AMD 
Opteron(tm) 2378 Processors and 64GB of RAM. The proposed kernels have 
been implemented in C-|—1-. In addition, we implemented a fast version of 
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Gram matrix computation for NCI1 dataset 



h 

Figure 5: Time needed to compute the kernel matrix for the ODD-ST/i, 
ODD-ST+/ 1 , the NSPDK and the FS kernels, as a function of their parameter 
h, on NCIl. 

the FS kernel in C++. All these kernels adopt an hashing function, similar 
in spirit to [45]. As for the p-random walk and graphlet kernels, we adopted a 
publicly available Matlab implementation"^. Thus, the times for the p-random 
walk and the graphlet kernels are reported just for a qualitative comparison. 
The time needed to compute the kernel matrix for the ODD 5 r+ kernel in¬ 
creases roughly linearly with respect to the parameter h for both datasets. 
As expected the constant factors are higher than the ones of the ODD^'t^, 
but the ODDst-i- is faster than (or comparable to) NSPDK. Note that we do 
not report the computational times for and since their 

computational requirements are basically the same as the corresponding base 
kernels: the computation of the novel weight function does not add a signif¬ 
icant computational burden. 

Moreover, in Table 4 we report the average computational time for a 
single fold with the optimal parameters on the four largest datasets: CAS, 


"^http://www.di.ens.fr/~shervashidze/code.html 
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Gram matrix computation for CAS dataset 



h 

Figure 6: Time needed to compute the kernel matrix for the ODD-ST/i, 
ODD-ST+/ 1 , the NSPDK and the FS kernels, as a function of their parameter 
h, on CAS dataset. 

AIDS, NCIl, GDD. The parameters influencing the speed of the kernel are 
reported between brackets. In this case, we reported the times corresponding 
to all the considered kernels. Even when comparing the executions related 
to the optimal parameters, ODDs' 7 ’+ is faster or comparable to NSPDK and 
ODDsTh is faster or comparable to FS. 

6.2. Experiments on full NCI datasets 

In this set of experiments, we analyze how the proposed kernels and 
the competitors scale up with bigger datasets. We considered two datasets, 
NCI123 and NCFAIDS, each one with more than 40,000 examples (see Ta¬ 
ble 2). 

In NCI123^ the growth inhibition of the MOLT-4 human Leukemia tumor 
cell line is measured as a screen for anti-cancer activity. For each compound 
an activity score of -LogGI50 is measured, where GI50 is the concentration 
of the compound required for 50% inhibition of tumor growth. A compound 


^http://pubchem.ncbi.nlm.nih.gov/bioassay/123 
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Kernel 

GAS 

AIDS 

NGIl 

GDD 

Graphlet 

58" 

54" 

133" 

1715" 

p-random walk 

7Qh 

35h 

— 

— 


(h=7) 

(h=8) 

(h=-) 

(h=-) 

FS 

13" 

5" 

28" 

17" 


(h=3) 

(h=9) 

{h=8) 

(h=i) 

NSPDK 

24" 

217" 

192" 

395" 


(h=2, d=6) 

(h=8,d=6) 

(h=5,d=4) 

(h=2,d=6) 

ODDst+ 

18" 

56" 

44" 

29" 


(h=3) 

(h=7) 

(h=4) 

(h=i) 

ODD™ 

47" 

51" 

110" 

246" 


(h=5) 

(h=6) 

{h=6) 

(h=2) 

ODDst-i- 

32" 

111" 

205" 

199" 


(h=4) 

(h=8) 

{h=i) 

(h=i) 

ODD™ 

179" 

61" 

165" 

541" 


(h=8) 

(h=5) 

{h=4) 

(h=2) 


Table 4: Average time required for computing the kernel matrix for the 
p-random walk, the Graphlet, the Fast Subtree, the Neighborhood Sub¬ 
graph Pairwise Distance, the ODDsthi fhe ODDg^^^, the ODDst+ and the 
ODDg^^^ kernels on CAS, AIDS, NCIl and GDD datasets with the optimal 
kernel parameters (reported between brackets). 


is classihed as active (positive class) or inactive (negative class) if the activ¬ 
ity score is, respectively, above or below a specihed threshold. The dataset 
is composed by 40,952 examples. NCLAIDS® is an anti-HIV database that 
contains 42,682 molecules, experimentally detected to protect (conhrmed ac¬ 
tive), moderately protect (conhrmed moderate) or not protect (inactive) the 
GEM cells from HIV-1 infection. From these classes we derived a binary 
classihcation problem, i.e. distinguishing inactive from conhrmed and mod¬ 
erately protective molecules. 

Since these two datasets are unbalanced, for this set of experiments we 
adopted the Area Under the Receiver Operating Gharacteristic curve (AU- 
ROG or AUG) as performance measure, since it is suited for unbalanced 
datasets. The experimental setup in this case is diherent w.r.t. the one pre- 


®http://wiki.nci.nih.gov/display/NCIDTPdata/AIDS-|-Antiviral-|-Screen+Data 
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Kernel 

NCI123 

NCLAIDS 

Graphlet 

54.93* (7) 

67.74* (7) 


± 0.24 

± 0.15 

FS 

61.08* (6) 

83.73* (5) 


± 0.34 

± 0.17 

NSPDK 

62.45 (3) 

83.80* (3) 


± 0.39 

± 0.23 

ODDsTfc 

62.11 (4) 

83.77* (4) 


± 0.30 

±0.22 

ODDTAnh 
=> i h 

62.76 (2) 

85.56 (2) 


±0.21 

± 0.23 

ODDst^, 

61.70* (5) 

83.36* (6) 


± 0.36 

± 0.30 

ODDTAnh 

63.20 (1) 

85.64 (1) 


± 0.29 

± 0.15 


Table 5: Average AUC results ± standard deviation in nested 10-fold 
cross validation for the Graphlet, the Fast Subtree, the Neighborhood Sub¬ 
graph Pairwise Distance, the ODDst^^ fhe ODDst+i the ODDg^^^ and the 
ODDgT™ kernels obtained on NCI123 and NCLAIDS datasets. The rank 
of the kernel is reported between brackets. The symbol * denotes the ker¬ 
nels whose performance difference with respect to the top-ranked kernel is 
statistically signihcant. 


sented in Section 6.1. Indeed, when the number of examples is large, com¬ 
puting the Gram matrix is unfeasible. In this case, for each considered kernel 
configuration, we computed the explicit features (memorized in a sparse for¬ 
mat) associated to each example. With this explicit feature representation, it 
is possible to train a linear SVM’^. Note that the computed solution is equiv¬ 
alent to the one that can be found by a G-SVM applied to the kernel matrix 
generated by the graph kernel. However, in this way it is possible to handle 
very large datasets in a reasonable amount of time. A 10x10 GV test with 
conhdence level 95% (and 10 degrees of freedom) has been executed between 
each pair of kernels on the two datasets [44]. Table 5 reports the AUG results 
obtained, for the two considered datasets, by kernels for which it is possible 
to generate the explicit feature space representation of input examples. The 


^In our implementation we adopted Liblinear [46]. 
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combination of the techniques proposed in the paper, ST_|_ and tank, leads 
to best performances on both datasets. The performance difference between 
ODDst+ e ODDgT™ is statistically signihcant on both datasets. The use of 
tank yields statistically signihcant improved performances for ODDg;^^^ on 
NCI_AIDS with respect to all other kernels except ODDg^^^. 

Figure 7 reports the average computational time required to perform the 
learning procedure for a hxed kernel, as a function of the h parameter, for 
the NCI123 and NCI^IDS datasets. This procedure comprehends the fea¬ 
ture generation step, and the training phase of the linear SVM model. We 
decided to report the overall times here because the run-times of linear SVM 
depends on the characteristics of the kernel, and thus comparing only the 
feature generation part would not be fair. With the considered learning pro¬ 
cedure, the number of non-zero features generated by the kernel inhuences 
the total run-time. Indeed, the FS kernel is the fastest one, being the one that 
generates the smallest number of features. The time required by the training 
procedure grows almost linearly for ODDst^, ODDg^’^'^ and ODD 5 'r+, while 
it grows more than linearly for ODD^y^. Note, however, that is 

still faster than NSPDK. It is interesting to note that NSPDK with d = 1 is 
slower than NSPDK with d = 7 on NCI123, even if the latter has a larger 
feature space. In this case, probably the former kernel is less discriminative 
and thus the corresponding optimization problem that the linear SVM must 
solve is more difficult. 

Table 6 reports the computational time required to compute the different 
kernels with the optimal parameters obtained by a 10-fold cross validation. 
Note that higher computational times generally corresponds to higher values 
for the optimal h parameter. 

On the considered datasets, higher AUC corresponds to higher computa¬ 
tional times for the respective kernel. It is interesting to analyze the relation¬ 
ship between AUC values and running times for non-optimal parameters, i.e. 
to understand which kernel is the most convenient if there is a strict time 
constraint to comply to. Figure 8 plots the performances of the different ker¬ 
nels with respect to the time required to perform the training procedure, for 
NCI123 and NCLAIDS datasets. In NCI123 dataset, and 

have the highest points in the plot starting from approximatively a runtime 
of 400 seconds. Below that computational time, the NSPDK is the best per¬ 
forming kernel. On the other hand, on NCI_AIDS dataset, ODDg^^^ and 
ODD^y”,!^ are the better performing kernels for almost every time threshold. 
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Figure 7: Time needed to perform all the training procedure, as a function 
of h, for all the considered kernels on NCI123 (left) and NCI_AIDS (right) 
datasets. 


7. Conclusions and future works 

The contribution of this paper is twofold. First, we propose a novel 
instance of the ODD graph kernel based on a novel tree kernel, ST+. This 
constitutes an example of how the generality of the framework can potentially 
lead to the dehnition of novel graph kernels that can improve the state-of- 
the-art. Second, we dehne a novel, non-linear, feature weighting scheme for 
the ODD kernels, that can in principle be applied to any graph kernel with 
an explicit feature space representation. As a future work, we plan to apply 
this and other weighting schemes also to other state-of-the-art graph kernels. 
The experimental results show that the proposed kernels have state of the art 
performances on six benchmark graph datasets from bioinformatics, and on 
two graph datasets for image classihcation. Moreover, experiments on two 
large graph datasets show that our approach is able to scale up to real-world 
sized datasets. 
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Figure 8: Relationship between the AUC (obtaindes in 10-fold cross valida¬ 
tion) and the time needed to perform all the training procedure. A point 
is reported for each h and C parameters combination, for all the considered 
kernels on NCI123 (left) and NCLAIDS (right) dataset.Note that the x axis 
is in log scale. 
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Table 6: Time needed to perform all the training procedure with the optimal 
parameter conhguration (reported between brackets) for all the considered 
kernels on NCI123 and NCLAIDS datasets. 
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